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1 Introduction 


Understanding the Adam optimizer requires a solid grasp of several foundational concepts 
in machine learning and optimization algorithms. In this comprehensive guide, we start 
from the basics and build up to the full Adam optimizer, illustrating each component 
with multiple detailed examples. The topics covered include: 


e Basic Concepts 


— Scalars, Vectors, and Matrices 
— Functions and Gradients 


— Learning Rate and Gradient Descent 


Backpropagation 


Chain Rule in Calculus 


— Computing Gradients in Neural Networks 


— Examples of Backpropagation 


Optimization Algorithms 


— Stochastic Gradient Descent (SGD) 
— Momentum 


— Adaptive Learning Rates 
e Adam Optimizer Detailed Mechanics 


— Mathematical Formulation 
— Step-by-Step Examples of Each Component 
— AdamW Variant 


e Adam in Task Switching and Memory Retention 


— How Momentum Affects Learning 


— Examples Illustrating Memory Retention 


e Summary and Insights 


2 Basic Concepts 


2.1 Scalars, Vectors, and Matrices 


Scalar: A single number (e.g., a = 5). 
Vector: An ordered list of numbers (e.g., v = [v1, v2, gəl). 
Matrix: A two-dimensional array of numbers (e.g., M is a 3 x 3 matrix). 


2.2 Functions and Gradients 


Function: A mapping from input to output (e.g., f(x) = 2”). 
Gradient: A vector of partial derivatives, indicating the direction and rate of fastest 
increase. 


2.2.1 Example 1: Scalar Function 
f(x) =a" 


Derivative: 


f'(a) = 2x 
2.2.2 Example 2: Multivariate Function 


f(y =° rg” 
Gradient: 


2.2.3 Example 3: More Complex Function 


f(z, y, z) = ruz 
Gradient: 
Vi = İz, xz, xy] 


2.3 Learning Rate and Gradient Descent 


Learning Rate (a): A hyperparameter that controls the step size during optimization. 
Gradient Descent Update Rule: 


Onew = oja a aVoL(0) 


2.3.1 Example 1: Minimizing f(x) = z? 


Starting Point: zo = 4 
Learning Rate: a = 0.1 
Gradient: f'(x) = 2x 
Update: 
Znew = Told — Q + 2%o1g 


Step 1: 
zı =4-0.1-8=4-08=3.2 


2.3.2 Example 2: Minimizing f(x,y) = z? + y? 


Starting Point: İzo, gol = [8, 4] 
Learning Rate: a = 0.1 
Gradient: Vf = [2z, 2y] 
Update: 
Lnew = Told — Q 2Lold 


Ynew = Yold — Q 2Yold 


Step 1: 
zı = 3 — 0.1-6 = 3 — 0.6 = 2.4 


yı =4-— 0.1-8 = 4-0.8 = 3.2 


2.3.3 Example 3: Minimizing f(0) = 0? with Different Learning Rates 


If a is too large, we might overshoot the minimum. 
If a = 1, starting at 0 = 1: 


Gradient : 2 - 1 = 2 
Update : (new = 1 — 1 - 2 = —1 
Next Gradient : 2 - (—1) = —2 

Update : Onew = —1 — 1 - (—2) = 1 


The parameter oscillates between 1 and -1 without converging. 


3 Backpropagation 


3.1 Chain Rule in Calculus 


Chain Rule: If a variable z depends on y, which in turn depends on z, then: 


dz dz dy 


dr du dz 


3.1.1 Example 1: Simple Chain 


=z 
z =sin(y) 
d 
= = cos(y) - 2x 


3.1.2 Example 2: Multiple Layers 


y= 
z=’ 
d 
et 3? - 2x = bry’ 
dx 


3.1.3 Example 3: Function Composition 


y = ln(z) 
z— ey 
ln(x 
dx x x x 


3.2 Computing Gradients in Neural Networks 
Neural Network: Composed of layers with weights and activation functions. 

Objective: Compute the gradient of the loss with respect to each weight. 
3.2.1 Example 1: Single Neuron 
Input: x, Weight: w 

y = we 
Loss : L = (y — t)’ 

where t is the target. 


Gradient: 2 he 
y 
= : = 2(y —t)- 
dw dy dw -—. 
3.2.2 Example 2: Two-Layer Network 
Layer 1: 
Yı — WIT 
Activation: 
a — o(y) 
(e.g., sigmoid) 
Layer 2: 
Y2 = W24 
Loss: 
L = (yə =t) 


Gradient w.r.t w: 
dL dL dy da dy 


dw, dyə ` da dy dw, 


3.2.3 Example 3: General Backpropagation Step 


For any weight w in the network: 


dL — OL : I] OYprev . OYprev 
dw E Nout Oy Ow 


3.3 Examples of Backpropagation 


3.3.1 Example 1: Simple Network 
Input: x = 1, Target: t = 2, Weight: w = 0.5 


y=wr=0.5x1=0.5 
Loss : L = (y — t}? = (0.5 — 2)” = 2.25 


Gradient: az 
—  2(y — t)z = 2(—1.5)(1) = —3 
dw 


Update with Learning Rate a = 0.1: 


dL 
Wnew = 1) — a—— = 0.5 — 0.1(—3) — 0.5 -- 0.3 = 0.8 
dw 


3.3.2 Example 2: Non-linear Activation 


Activation Function: 


Forward Pass: 
z=wr=0.5x1=0.5 


a = oz) ~ 0.6225 
Loss : L = (a — t)? = (0.6225 — 1)” = 0.142 
Backpropagation: 


ah _ (a — #) = 2(—0.3775) = —0.755 
da 
da 
Of 6) 2) 20285 
dz 
dz 
ue 


dL m dL da dz 
dw da dz dw 
Wnew = 0.5 — 0.1(—0.177) = 0.5 + 0.0177 = 0.5177 


a~ —0.755 x 0.235 x 1 ~ —0.177 


3.3.3 Example 3: Two Inputs 
Inputs: xı = 1, 72 = 2 
Weights: w = 0.5, w = —0.5 
yY = W121 + W212 
Target : t = 1 
Loss : L = (y — t)’ 
Gradients: 


dL 

— =2(y-t 

du (y zı 
dL 

də zı 2(y — t)r2 
W2 


4 Optimization Algorithms 


4.1 Stochastic Gradient Descent (SGD) 


SGD Update Rule: 
Onew = ola K aVoL(0) 


Stochastic: Uses a random subset (mini-batch) of data for each update. 


4.1.1 Example 1: SGD with Mini-Batch Size 1 


Dataset: {(x;,t;)} 
For each sample: 


a) Compute gradient VoL(6; xi, ti) 
b) Update 9 


4.1.2 Example 2: SGD with Mini-Batch Size 4 


Use 4 samples to compute an average gradient. Update parameters once per mini-batch. 


4.1.3 Example 3: Comparison with Batch Gradient Descent 


Batch Gradient Descent (Batch GD): Uses the entire dataset to compute gradients. 
Stochastic Gradient Descent (SGD): Updates more frequently with noisier gra- 
dients. 


4.2 Momentum 


Purpose: Accelerate convergence and avoid local minima by adding a fraction of the 
previous update to the current update. 
Update Rule: 
Ut = Buz + (1 Çə B)VoL(0) 


0, = -1 — ayy 


where 9 is the momentum coefficient (e.g., 0.9). 


4.2.1 Example 1: Momentum Update 
Ug = 0 


At step 1: 
vı = 0.9 x 040.1 x gı = 0.19: 


0, = 00 — av 


4.2.2 Example 2: Accelerated Descent 


In a valley-shaped loss function, momentum helps roll down faster. 


4.2.3 Example 3: Overshooting Minima 

Momentum can cause the parameter to overshoot the minimum and oscillate before set- 
tling. 

4.3 Adaptive Learning Rates 


Idea: Adjust the learning rate for each parameter based on the history of gradients. 
Algorithms: 


e AdaGrad: Adjusts learning rates inversely proportional to the square root of the 
sum of squared gradients. 
Ti — riz gç 


Q 
gee E eee eee 
t ti mper 


4.3.1 Example 1: AdaGrad with Sparse Gradients 


Parameters with infrequent updates get larger learning rates. 


4.3.2 Example 2: AdaGrad Accumulating Gradient Squares 


After many updates, learning rates become very small, slowing down learning. 


4.3.3 Example 3: Limitations of AdaGrad 


Not suitable for non-convex problems due to decaying learning rates. 


5 Adam Optimizer Detailed Mechanics 


Adam combines momentum and adaptive learning rates to provide efficient optimization. 


5.1 Mathematical Formulation 


Initialize: 
mo =0 (First moment vector) 
vo =0 (Second moment vector) 
t=0 (Time step) 
Hyperparameters: 


e a: Learning rate 

e 0): Decay rate for the first moment (e.g., 0.9) 

e 02: Decay rate for the second moment (e.g., 0.999) 

e c: Small number to prevent division by zero (e.g., 1075) 


Update Rules: 


1. Increment Time Step: 
ti 


2. Compute Gradient: 
= Voli (1-1) 


3. Update Biased First Moment Estimate: 

m = Bimi + (1 — Öl)gi 
4. Update Biased Second Moment Estimate: 

Ut = bavi + (1 — Ba) gf 


5. Compute Bias-Corrected First Moment Estimate: 


x TT 
Mm, = 
t 1 — Bt 
6. Compute Bias-Corrected Second Moment Estimate: 
A Ut 
U = 
1—) 
T. Update Parameters: 
0, = bea — —— 


6 Step-by-Step Examples of Each Component 


Assumptions for Examples: 


a = 0.001 
ğı — 0.9 
By = 0.999 
eS 10" 


6.0.1 Example 1: Single Parameter Optimization 


Suppose we have a parameter 6 and gradients at each time step: 
Time Step 1: 
gı Z 0.1 


Step-by-Step Calculations: 
Increment Time Step: 


Compute mı: 


Mm, = Bimo + (1 = Br) gr =0.9x0+0.1 x 0.1 = 0.01 
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Compute vı: 
vı = Özü + (1 — 92)g) = 0.999 x 0 + 0.001 x (0.1)? = 0.00001 


Bias Correction: 


oo a 2 OO 
32 OO” Wak 
E 0.00001 _ 0.00001 _ > o] 


1-8 10.999! 0.001 
Parameter Update: 
añu 0.001 x 0.1 _ 


ə e aS 
ie mə qal: 709 : 


0, = 0Q — 


6.0.2 Example 2: Continuing to Time Step 2 


Assuming gə — 0.2 
Increment Time Step: 


Compute mz: 
mə = Pim, + (1 — B1)g2 = 0.9 x 0.01 + 0.1 x 0.2 = 0.009 + 0.02 = 0.029 
Compute vs: 
V2 = Bov1+(1—B2)g = 0.999 0.00001+0.001 x (0.2)? = 0.00000999+0.00004 = 0.00004999 


Bias Correction: 
ı mə 0.029 0.029 
20753 EH 019 
d = 0. 0.00004999 m 0.00004999 
1— 9 1 — 0.9992 0.001998 
Parameter Update: 


ez 0.1526 


zx 0.025 


amg 0.001 x 0.1526 


x 
~ 


Jade “(45815 1079 


b2 = 0) — ~ 0) — 0.001 x 0.965 ~ 6) — 0.000965 


6.0.3 Example 3: Third Time Step 


Assuming g3 = —Ü.1 
Increment Time Step: 


Compute m3: 
mə = Öimə + (1 — 91)g3 = 0.9 x 0.029 + 0.1 x (—0.1) = 0.0261 — 0.01 = 0.0161 
Compute v3: 


V3 = Bov2+(1—B2)g3 = 0.999 0.00004999+0.001 x (—0.1)? = 0.00004994+0.00001 = 0.00005994 
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Bias Correction: 
5 mə 0.0161 0.0161 


— — - az 0.0594 
"agi 1-—0.95 0.271 
0.00005994 0.00005994 
pees 5 ~ 0.02 


~ 1= 6 1 :— 0.9999 0.002997 
Parameter Update: 


afs 0.001 x 0.0594 
Soe ~ 0 — 0.001 x 0.42 ~ 6 — 0.00042 
Vee 2014145107 ” a : 


Observation: 


03 = 0, — 


ə The parameter updates are influenced by both the gradient and the history of gra- 
dients. 


e The bias correction ensures unbiased estimates of the moments. 


7 AdamW Variant 


7.1 Weight Decay Regularization 


Purpose: Prevent overfitting by penalizing large weights. 
In AdamW, weight decay is decoupled from the gradient updates. 


Update Rule: 
0: = 0-1- Q (= + M) 


Ug + E 


where A is the weight decay coefficient. 


7.2 Examples with AdamW 


7.2.1 Example 1: Including Weight Decay 
Parameters: 
A = 0.01 
At Time Step 1: 


A 


0i = Oa = 0.001 ( :.: 0.016) 


x 


Uy +E 
This penalizes the weight 6p. 
7.2.2 Example 2: Effect on Parameter Norms 
Weight decay reduces the magnitude of weights over time, helping in generalization by 
simplifying the model. 
7.2.3 Example 3: Comparison with L2 Regularization in Adam 


In traditional Adam with L2 regularization, weight decay is coupled with gradient updates, 
which can interfere with adaptive learning rates. AdamW’s decoupling provides better 
control. 
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8.1 


Adam in Task Switching and Memory Retention 


How Momentum Affects Learning 


Momentum retains information about past gradients. In task switching, momentum can 
cause the optimizer to continue updating parameters in directions relevant to the previous 


task. 


8.2 


Examples Illustrating Memory Retention 


8.2.1 Example 1: Training on Task A 


Parameters: Initial weights optimized for Task A. 
Gradients: Consistent direction for Task A. 


Momentum Accumulation: m, builds up in the direction beneficial for Task A. 


8.2.2 Example 2: Switching to Task B 


Gradients Change: New gradients may point in different directions. 


Momentum Effect: The accumulated m, from Task A continues to influence 
updates. 


Result: Parameters adjust slowly to Task B, partially retaining Task A’s informa- 
tion. 


Summary and Insights 


Adam Optimizer combines the benefits of momentum and adaptive learning rates. 
Mathematical Components: 


— First Moment (m,): Similar to momentum, averages past gradients. 


— Second Moment (v;): Tracks the variance of gradients, adjusting learning 
rates. 


— Bias Correction: Ensures unbiased estimates, especially important in early 
steps. 


Backpropagation is essential for computing gradients needed by Adam. 
Examples illustrate how each component affects the optimization process. 


AdamW Variant improves upon Adam by decoupling weight decay, enhancing 
generalization. 


Task Switching Phenomenon: 


— Momentum can cause the optimizer to retain information from previous tasks. 


— Adaptive Learning Rates slow down parameter updates for important weights 
from prior tasks. 


13 


9.1 Final Thoughts 


Understanding the Adam optimizer at a deep mathematical level reveals how its com- 
ponents interact to provide efficient and robust optimization in neural networks. By 
examining detailed examples, we see how the first and second moments, along with bias 
correction and weight decay, contribute to the optimizer’s performance, especially in com- 
plex scenarios like task switching. 


14 


